Part One

DOMAIN: Automobile

CONTEXT: The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes

DATA DESCRIPTION: The data concerns city-cycle fuel consumption in miles per gallon

Attribute Information:

  1. mpg: continuous
  2. cylinders: multi-valued discrete
  3. displacement: continuous
  4. horsepower: continuous
  5. weight: continuous
  6. acceleration: continuous
  7. model year: multi-valued discrete
  8. origin: multi-valued discrete
  9. car name: string (unique for each instance)

PROJECT OBJECTIVE: Goal is to cluster the data and treat them as individual datasets to train Regression models to predict ‘mpg’

import and warehouse data:
• Import all the given datasets and explore shape and size. 
• Merge all datasets onto one and explore final shape and size.
• Export the final dataset and store it on local machine in .csv, .xlsx and .json format for future use.
• Import the data from above steps into python.
Import all the given datasets and explore shape and size.
Merge all datasets onto one and explore final shape and size.
Export the final dataset and store it on local machine in .csv, .xlsx and .json format for future use.
Import the data from above steps into python.

Data cleansing:

• Missing/incorrect value treatment
• Drop attribute/s if required using relevant functional knowledge
• Perform another kind of corrections/treatment on the data.
There are no missing values observed.
Finding column which is having defective column values
Based on the viewing dataset values, replace NaN of hp column with median values.
Data analysis & visualisation
• Perform detailed statistical analysis on the data.
• Perform a detailed univariate, bivariate and multivariate analysis with appropriate detailed comments after each analysis. 
Hint: Use your best analytical approach. Even you can mix match columns to create new colum

observation:

mpg: 
    mean and median are almost similar, so the distribution seems to be normal, we can confirm it upon ploting distribution plot
    75% of the observed vehicles are having less then or equal to 29 miles per gallon fuel efficacy
    The are possible outlier, since max value differs a lot from 75 percentile.

cyl:
    Minimum number of cylinder are 3
    Maximum number of cylinder are 8

disp:
    engine displacement seems to be  positively skewed (right skewed) since  mean is greater than the median.
    75% of the observed vehicles are having less then or equal to 262 inches in engine displacement.

hp:
    horse power seems to be  positively skewed (right skewed) since  mean is greater than the median.
    There are possible outliers in the data.

wt:
    mean and median are almost similar, data observed for Weight column seems to be normally distributed

acc:
    mean and median are almost similar, data observed for accelerate column seems to be normally distributed       
lets check the outlier in the data.

Observation:

mpg and cyl are negatively correlated, as number of cyl grows mpg of vehicle decreases.
mpg and disp are negatively correlated, as number of displacement inches increases mpg of vehicle decreases.
mpg and hp are negatively correlated, as number of horse power increases mpg of vehicle decreases.
mpg and wt are negatively correlated, as weight increases mpg of vehicle decreases.
cyl and disp are positively correlated, as number of cyliders increases, vehicles temds to have higher displacement.
cyl and hp are positively correlated, as number of cyliders increases, vehicles temds to have higher horse power. 
cyl and weight of the vehicle is positively correlated, heavier vehicles tends to have more cylinders.
disp and hp are positively correlated, more displacement (inches) more Horse power.
disp and wt are positively correlated, Heavier vehicle tends to have larger engine displacement.
acceleration has negative correlation with cyl, dsip, hp and wt.

Observation:

More cars are having MPG between 12 and 20 mpg, the data distribution (normal) is bit right skewed.
There are more even numberd cylinders observed.
engine displacement is following right skewed distribution, more cars are having displacement between 80 to 160.
There are more cars with horsepower between 50 and 100, The data is right skewed.
vehicle weight (lbs.) is having right skewedness, more cars are having weight between 2000 and 3000.
Time to accelerate from O to 60 mph (sec.) is following a oraml distribution.
model year (modulo 100) shows that each year almost equal number of cars being released to market.
origin of car (1. American, 2. European,3. Japanese), most cars are with american origin.
Outliers observed in columns - MPG (miles per gallon), horsepower (hp), time to accelerate from O to 60 mph (sec.) (acc),
The above graph shows that variables are dependent on one another.

By observation, we can say that most variables are having either positive or negative correlation and PCA can be applied to reduce the number of column which can then be used for training the machine learning models.

Origin column distribution also shows that the data are overlapping for mpg, acc, yr. disp, hp, wt and dispercyl may contribute to different clusters in the data. So inclusion of this column needs to be tried and decide if this is needed for model building.

When we look at the origin column - it looks like the data is combined from 3 disparate sources.

acc column is having negative correlation with cyl, disp, hp, wt. We can deduce few relationships on this data which could be like - heavier the car, it tends to accelerate less.
acc vs mpg shows a cloud on scatter plot, which doesn't signify a good relationship, but there is a tail  which depicts that it could have a positive correlation.

The new column dispercyl is combination of disp column over cyl column signify displacement of engine over cylinder and when this column is observed against other columns - we can see minimum 3 cluster formation.
Machine learning:
• Use K Means and Hierarchical clustering to find out the optimal number of clusters in the data. 
• Share your insights about the difference in using these two methods.
K Means

observation:

Using elbow method we can determine optimal number of clusters ie the point after which the average distortion start decreasing in a linear fashion. Thus for the given data, we conclude that the optimal number of clusters for the data is 4.
Checking effectiveness of cluster using silhouette analysis.

Observation:

With k value as 4 (clusters), the silhouette coefficient values for each cluster is above average silhouette score.
Thickness of each silhouette plot is almost uniform and hence can be decided the value of k as 4 is optimal.
Hierarchical clustering
There are different methods of finding distance between clusters, here we have used 3 methods and results are having slight variation and are acceptable.

The cophenetic correlation for a cluster tree is defined as the linear correlation coefficient between the cophenetic distances obtained from the tree, and the original distances (or dissimilarities) used to construct the tree.

The cophenetic distance between two observations is represented in a dendrogram by the height of the link at which those two observations are first joined. That height is the distance between the two subclusters that are merged by that link.

Observations: Optimal clusters

To find optimal number of clusters, we have to look for the clusters with the longest branches, the distance/length determines how good the clusters are. This is just rule of thumb saying look for the clusters with the longest ‘branches’, the shorter they are, the more similar they are to following ‘twigs’ and ‘leaves’. 

Here we can observe that by drawing a line close to distance 3 - we will have 4 clusters.
K-means clustering Hierarchical Clustering
k-means algorithm iteratively tries to find the centroids. Hierarchical methods can be either divisive or agglomerative.
k-means algorithm is parameterized by the value k. In Hierarchical, it is not mandatory to mention number of clusters
In K means, users of algorithm needs to k value before using algorithm properly In hierarchical clustering users can stop at any number of clusters by interpreting the dendrogram.
median or mean is used as a cluster centre to represent each cluster. Agglomerative sequentially combine similar clusters until only one cluster is obtained.
For every iteration of k means there is a chance of a point shifting cluster In Hierarchical clustering, the clusters are formed based on the distance calculated and shifting of points is not observed
K-Means is implicitly based on pairwise Euclidean distances between data points Hierarchical clustering, different metric can be used to find the distance

Answer below questions based on outcomes of using ML based methods.

• Mention how many optimal clusters are present in the data and what could be the possible reason behind it.
• Use linear regression model on different clusters separately and print the coefficients of the models individually
• How using different models for different clusters will be helpful in this case and how it will be different than using one single model without clustering? Mention how it impacts performance and prediction.
Applying K means with determined number of clusters and preparing data for Linear regression
As compared Main dataset with clustered dataset, it has negative impact!. Perhaps gathering more data would help in solving this problem. Expectation is to have better score of LinearRegression upon applying clustering

Improvisation:

Other Parameters can be collected in the dataset so as to predict the mpg column better. Fuel Injector , aerodynamic drag indicator, Oil Usage will be helpful if collected.
Usage of cargo racks on top of car, this info would also be helpful if collected. If users use this rack, increase aerodynamic drag and lower fuel economy.
Towing a trailer or carrying excessive weight decreases fuel economy. Indicator of this facility would also be helpful.
Using 4-wheel drive reduces fuel economy. Four-wheel drive vehicles are tested in 2-wheel drive. Engaging all four wheels makes the engine work harder and increases transfer case and differential losses.
Indicator of 2-wheel or 4-wheel drive will also be helpful.
Number of gears will also be helpful in determination of mpg of the vehicle.

Part Two

DOMAIN: Manufacturing

CONTEXT: Company X curates and packages wine across various vineyards spread throughout the country.

DATA DESCRIPTION: The data concerns the chemical composition of the wine and its respective quality.

PROJECT OBJECTIVE: Goal is to build a synthetic data generation model using the existing data provided by the company.

Steps and tasks:
1. Design a synthetic data generation model which can impute values [Attribute: Quality] wherever empty the company has missed recording the data
There are 18 rows having Quality as null or NAN which has to be imputed
Selected number of k values are fine and acceptable
Even WSS shows same result.

Part Three

DOMAIN: Automobile

CONTEXT: The purpose is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.

DATA DESCRIPTION:
The data contains features extracted from the silhouette of vehicles in different angles. Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars.

All the features are numeric i.e. geometric features extracted from the silhouette.

PROJECT OBJECTIVE:
Apply dimensionality reduction technique – PCA and train a model using principal components instead of training the model using just the raw data

Data: Import, clean and pre-process the data
Only class column is of object data type and rest are either float or int.

EDA and visualisation:

Checking if there are any outliers.
There are no outliers observed.
Classifier:

Design and train a best fit SVM classier using all the data attribute

Dimensional reduction: perform dimensional reduction on the data.

Classifier: Design and train a best fit SVM classier using dimensionally reduced attributes

Conclusion: Showcase key pointer on how dimensional reduction helped in this case.

Part Four

DOMAIN: Sports management

CONTEXT: Company X is a sports management company for international cricket. DATA DESCRIPTION: The data is collected belongs to batsman from IPL series conducted so far. Attribute Information:

  1. Runs: Runs score by the batsman
  2. Ave: Average runs scored by the batsman per match
  3. SR: strike rate of the batsman
  4. Fours: number of boundary/four scored
  5. Six: number of boundary/six scored
  6. HF: number of half centuries scored so far

PROJECT OBJECTIVE: Goal is to build a data driven batsman ranking model for the sports management company to make business decisions.

Steps and tasks:

1. EDA and visualisation: Create a detailed performance report using univariate, bi-variate and multivariate EDA techniques. Find out all possible hidden  patterns by using all possible methods.

2. Build a data driven model to rank all the players in the dataset using all or the most important performance feature
There are no null values observed 
EDA and visualisation:

Create a detailed performance report using univariate, bi-variate and multivariate EDA techniques. Find out all possible hidden patterns by using all possible methods

Except Name column, rest are numerical
Data distribution for Runs seemms to be bit right skewed
75% of data for runs column are below 330
75% of data for Fours column are below 28
75% of data for Sixes column are below 10
HF column is having good positive correlation against Run, Fours, sixes
Runs  is having good positive correlation against Run, Fours, sixes
Data distribution is normal for Runs column, Right Skewed for HF,Sixes, Fours and Ave which signifies there are few good players available in the data set
As mentioned above, we see good correlation among few variables.
There are more players which have score less than 300.
Average score seems to be around 40.
Very few players have scroe more then 30 Fours
There are very less sixes scored by playes.
Frequency of number of Half Centuries are very less.
The above plot would give nice view on distribution of data.
There are few outliers observed in all the columns.
These could be genuine observation as we observed in the data analysis, there are few good players whos score are higher.
Observing above 3 graphs - we can conclude that there are 2 or 3 good cluster.
We shall consider 2 cluster sicne there are sharp elbow observed in the graphs.
As observed in the printed table, based on combination of data in the column, players are ranked

Part 5:

Questions:

1. List down all possible dimensionality reduction techniques that can be implemented using python.
2. So far you have used dimensional reduction on numeric data. Is it possible to do the same on a multimedia data [images and video] and text data ? Please illustrate your findings using a simple implementation on python
Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data.

Dimensionality reduction techniques

Missing Value Ratio :

If the dataset has too many missing values, We can drop the variables having a large number of missing values in them. data columns with number of missing values greater than a given threshold can be removed.

Low Variance filter :

 Data columns with variance lower than a given threshold are removed, We apply this approach to identify and drop constant variables from the dataset. 

High Correlation filter :

A pair of variables having high correlation increases multicollinearity in the dataset. So, we can use this technique to find highly correlated features and drop them accordingly.

Random Forest :

Here the idea is to generate a large and carefully constructed set of trees against a target attribute and then use each attribute’s usage statistics to find the most informative subset of features.

Both Backward Feature Elimination :

At a given iteration, algorithm is trained on n input features. Then we remove one input feature at a time and train the same model on n-1 input features n times.   The input feature whose removal has produced the smallest increase in the error rate is removed.   Selecting the maximum tolerable error rate, we define the smallest number of features necessary to reach that classification performance.

Forward Feature Selection :

This is the inverse process to the Backward Feature Elimination. We start with 1 feature only, progressively adding 1 feature at a time,

Backward Feature Elimination and Forward Feature Construction, are quite time and computationally expensive. 

Factor Analysis:

When we have highly correlated set of variables, this technique can be applied. It divides the variables based on their correlation into different groups, and represents each group with a factor.

Principal Component Analysis:

This is one of the most widely used techniques for dealing with linear data. It is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set. Reducing the number of variables of a data set naturally comes at the expense of accuracy, but the trick in dimensionality reduction is to trade a little accuracy for simplicity. 

Independent Component Analysis:

Its is a statistical and computational technique for revealing hidden factors that underlie sets of random variables. Unlike principal component analysis which focuses on maximizing the variance of the data points, the independent component analysis focuses on independence, i.e. independent components.

ISOMAP:

Isomap is a nonlinear dimensionality reduction method. It is one of several widely used low-dimensional embedding methods. In this method, we determine the neighbors of each point, construct a neighborhood graph,  Compute shortest path between two nodes, compute lower-dimensional embedding. 

t-SNE:

t-distributed stochastic neighbor embedding (t-SNE) is a statistical method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map. This technique also works well when the data is strongly non-linear.

UMAP:

Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualisation similarly to t-SNE, but also for general non-linear dimension reduction. 

References:

https://en.wikipedia.org
https://www.kdnuggets.com/2015/05/7-methods-data-dimensionality-reduction.html

PCA for multimedia data

Reference : https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html

The dataset which is planned to use  contains handwritten digits from 0 to 9. We would like to group images such that the handwritten digits on the image are the same.
Plan is to use PCA and reduce the dimensonality to lower dimension so it can be applied for clustering. This is also helpful for visualization.
89% of variance is explained by 20 components, original data set had 64 component
The PCA is applied on image dataset which is having more features to define a single digit. But using PCA that can be reduced to lower number and use data for clustering.